Project Overview

  • Goal: Construct a model that accurately predicts whether an individual makes more than 50k/yr
  • Motivation: Help a non-profit identify donor candidates and understand how large of a donation to request
  • Data Source: 1994 US Census Data UCI Machine Learning Repository*

Note: Datset donated by Ron Kohavi and Barry Becker, from the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". Small changes to the dataset have been made, such as removing the 'fnlwgt' feature and records with missing or ill-formatted entries.

EDA: Data Dictionary

  • age: continuous.
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • education_level: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other.
  • sex: Female, Male.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours_per-week: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45222 entries, 0 to 45221
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              45222 non-null  int64  
 1   workclass        45222 non-null  object 
 2   education_level  45222 non-null  object 
 3   education-num    45222 non-null  float64
 4   marital-status   45222 non-null  object 
 5   occupation       45222 non-null  object 
 6   relationship     45222 non-null  object 
 7   race             45222 non-null  object 
 8   sex              45222 non-null  object 
 9   capital-gain     45222 non-null  float64
 10  capital-loss     45222 non-null  float64
 11  hours-per-week   45222 non-null  float64
 12  native-country   45222 non-null  object 
 13  income           45222 non-null  object 
dtypes: float64(4), int64(1), object(9)
memory usage: 4.8+ MB
Out[5]:
count unique top freq mean std min 25% 50% 75% max
age 45222 NaN NaN NaN 38.5479 13.2179 17 28 37 47 90
workclass 45222 7 Private 33307 NaN NaN NaN NaN NaN NaN NaN
education_level 45222 16 HS-grad 14783 NaN NaN NaN NaN NaN NaN NaN
education-num 45222 NaN NaN NaN 10.1185 2.55288 1 9 10 13 16
marital-status 45222 7 Married-civ-spouse 21055 NaN NaN NaN NaN NaN NaN NaN
occupation 45222 14 Craft-repair 6020 NaN NaN NaN NaN NaN NaN NaN
relationship 45222 6 Husband 18666 NaN NaN NaN NaN NaN NaN NaN
race 45222 5 White 38903 NaN NaN NaN NaN NaN NaN NaN
sex 45222 2 Male 30527 NaN NaN NaN NaN NaN NaN NaN
capital-gain 45222 NaN NaN NaN 1101.43 7506.43 0 0 0 0 99999
capital-loss 45222 NaN NaN NaN 88.5954 404.956 0 0 0 0 4356
hours-per-week 45222 NaN NaN NaN 40.938 12.0075 1 40 40 45 99
native-country 45222 41 United-States 41292 NaN NaN NaN NaN NaN NaN NaN
income 45222 2 <=50K 34014 NaN NaN NaN NaN NaN NaN NaN
Number of observations: 45222
Number of people with income > 50k: 11208
Number of people with income <= 50k: 34014
Percent of people with income > 50k: 24.78

Data Engineering/Preprocessing

Before this data can be used for modeling and application to machine learning algorithms, it must be cleaned, formatted, and structured.

Eng: Factor Names

Factor names with special characters, like -, can cause issues, so a cleaning may prove helpful.

Eng: Categorical Transformations

Working with categorical variables often involves transforming strings to some other value, frequently 0 or 1 for binomial factors, and {X = x_{0}, x_{1}, ..., x_{n} | 0, 1, .. n} multinomial.

These values may be ordinal (i.e. values with relationships that can be compared as a ranking, e.g. worst, better, best), or nominal (i.e. values indicate a state, e.g. blue, green, yellow).

====================================
Mapping for variable: numeric_income
Factor Value Numerical Value
0 <=50K 0
1 >50K 1
=======================================
Mapping for variable: numeric_workclass
Factor Value Numerical Value
0 State-gov 0
1 Self-emp-not-inc 1
2 Private 2
3 Federal-gov 3
4 Local-gov 4
5 Self-emp-inc 5
6 Without-pay 6
============================================
Mapping for variable: numeric_marital_status
Factor Value Numerical Value
0 Never-married 0
1 Married-civ-spouse 1
2 Divorced 2
3 Married-spouse-absent 3
4 Separated 4
5 Married-AF-spouse 5
6 Widowed 6
========================================
Mapping for variable: numeric_occupation
Factor Value Numerical Value
0 Adm-clerical 0
1 Exec-managerial 1
2 Handlers-cleaners 2
3 Prof-specialty 3
4 Other-service 4
5 Sales 5
6 Transport-moving 6
7 Farming-fishing 7
8 Machine-op-inspct 8
9 Tech-support 9
10 Craft-repair 10
11 Protective-serv 11
12 Armed-Forces 12
13 Priv-house-serv 13
==========================================
Mapping for variable: numeric_relationship
Factor Value Numerical Value
0 Not-in-family 0
1 Husband 1
2 Wife 2
3 Own-child 3
4 Unmarried 4
5 Other-relative 5
==================================
Mapping for variable: numeric_race
Factor Value Numerical Value
0 White 0
1 Black 1
2 Asian-Pac-Islander 2
3 Amer-Indian-Eskimo 3
4 Other 4
=================================
Mapping for variable: numeric_sex
Factor Value Numerical Value
0 Male 0
1 Female 1
=============================================
Mapping for variable: numeric_education_level
Factor Value Numerical Value
0 Doctorate 0
1 Prof-school 1
2 Masters 2
3 Bachelors 3
4 Assoc-voc 4
5 Assoc-acdm 5
6 Some-college 6
7 HS-grad 7
8 12th 8
9 11th 9
10 10th 10
11 9th 11
12 7th-8th 12
13 5th-6th 13
14 1st-4th 14
15 Preschool 15

Eng: Data Separation

For training an algorithm, it is useful to separate the label, or dependent variable ($Y$) from the rest of the data training_features, or independent variables ($X$).

Skew

The features capital_gain and capital_loss are positively skewed (i.e. have a long tail in the positive direction).

To reduce this skew, a logarithmic transformation, $\tilde x = \ln\left(x\right)$, can be applied. This transformation will reduce the amount of variance and pull the mean closer to the center of the distribution.

Why does this matter: The extreme points may affect the performance of the predictive model.

Why care: We want an easily discernible relationship between the independent and dependent variables; the skew makes that more complicated.

Why DOESN'T this matter: The distribution of the independent variables is not an assumption of most models, but the distribution of the residuals and homoskedasticity of the independent variable, given the independent variables, $E\left(u | x\right) = 0$ where $u = Y - \hat{Y}$ is of linear regression. In this analysis, the dependent variable is categorical (i.e. discrete or non-continuous) and linear regression is not an appropriate model.

Feature Skewness Mean Variance
0 Capital Loss 4.516154 88.595418 1.639858e+05
1 Capital Gain 11.788611 1101.430344 5.634525e+07
Optimization terminated successfully.
         Current function value: 0.692991
         Iterations 3
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         numeric_income   No. Observations:                45222
Model:                          Logit   Df Residuals:                    45221
Method:                           MLE   Df Model:                            0
Date:                Thu, 16 Apr 2020   Pseudo R-squ.:                 -0.2376
Time:                        22:57:36   Log-Likelihood:                -31338.
converged:                       True   LL-Null:                       -25322.
Covariance Type:            nonrobust   LLR p-value:                       nan
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
capital_loss  8.534e-05   2.28e-05      3.747      0.000    4.07e-05       0.000
================================================================================
Feature Skewness Mean Variance
0 Capital Loss 4.516154 88.595418 163985.81018
1 Capital Gain 11.788611 1101.430344 56345246.60482
2 Log Capital Loss 4.271053 0.355489 2.54688
3 Log Capital Gain 3.082284 0.740759 6.08362
Optimization terminated successfully.
         Current function value: 0.693117
         Iterations 3

Transformed model
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         numeric_income   No. Observations:                45222
Model:                          Logit   Df Residuals:                    45221
Method:                           MLE   Df Model:                            0
Date:                Thu, 16 Apr 2020   Pseudo R-squ.:                 -0.2378
Time:                        22:57:53   Log-Likelihood:                -31344.
converged:                       True   LL-Null:                       -25322.
Covariance Type:            nonrobust   LLR p-value:                       nan
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
capital_loss     0.0095      0.006      1.642      0.101      -0.002       0.021
================================================================================

Original model
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         numeric_income   No. Observations:                45222
Model:                          Logit   Df Residuals:                    45221
Method:                           MLE   Df Model:                            0
Date:                Thu, 16 Apr 2020   Pseudo R-squ.:                 -0.2376
Time:                        22:57:53   Log-Likelihood:                -31338.
converged:                       True   LL-Null:                       -25322.
Covariance Type:            nonrobust   LLR p-value:                       nan
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
capital_loss  8.534e-05   2.28e-05      3.747      0.000    4.07e-05       0.000
================================================================================

The logarithmic transformation reduced the skew and the variance of each factor.

Feature Skewness Mean Variance
Capital Loss 4.516154 88.595418 163985.81018
Capital Gain 11.788611 1101.430344 56345246.60482
Log Capital Loss 4.271053 0.355489 2.54688
Log Capital Gain 3.082284 0.740759 6.08362

Eng: Normalization and Standardization

These two terms, normalization and standardization, are frequently used interchangably, but have two different scaling purposes.

  • Normalization: scale values between 0 and 1
  • Standardization: transform data to follow a normal distribution, i.e. $X \sim N\left(\mu=0,\sigma^{2}=1\right)$

Earlier, capital_gain and capital_loss were transformed logarithmically, reducing their skew, and affecting the model's predictive power (i.e. ability to discern the relationship between the dependent and independent variables).

Another method of influencing the model's predictive power is normalization of independent variables which are numerical. Whereafter, each featured will be treated equally in the model.

However, after scaling is applied, observing the data in its raw form will no longer have the same meaning as before.

age workclass education_level education_num marital_status occupation relationship race sex capital_gain ... native_country numeric_income numeric_workclass numeric_marital_status numeric_occupation numeric_relationship numeric_race numeric_sex numeric_native_country numeric_education_level
0 0.301370 State-gov Bachelors 0.800000 Never-married Adm-clerical Not-in-family White Male 0.667492 ... United-States 0 0 0 0 0 0 0 0 3
1 0.452055 Self-emp-not-inc Bachelors 0.800000 Married-civ-spouse Exec-managerial Husband White Male 0.000000 ... United-States 0 1 1 1 1 0 0 0 3
2 0.287671 Private HS-grad 0.533333 Divorced Handlers-cleaners Not-in-family White Male 0.000000 ... United-States 0 2 2 2 0 0 0 0 7
3 0.493151 Private 11th 0.400000 Married-civ-spouse Handlers-cleaners Husband Black Male 0.000000 ... United-States 0 2 1 2 1 1 0 0 9
4 0.150685 Private Bachelors 0.800000 Married-civ-spouse Prof-specialty Wife Black Female 0.000000 ... Cuba 0 2 1 3 2 1 1 1 3

5 rows × 22 columns

[NbConvertApp] Converting notebook WIP_Donor_Classification.ipynb to html
[NbConvertApp] Writing 7636798 bytes to WIP_Class_Code.html
[NbConvertApp] Converting notebook WIP_Donor_Classification.ipynb to html
[NbConvertApp] Writing 7574835 bytes to WIP_Class_No_Code.html
[NbConvertApp] Converting notebook WIP_Donor_Classification.ipynb to slides
[NbConvertApp] Writing 7579938 bytes to WIP_Class_Slides.slides.html